1 MVP

  1. Load the diamonds.csv data set and undertake an initial exploration of the data. You will find a description of the meanings of the variables on the relevant Kaggle page
diamonds <- read_csv("diamonds.csv")
## Warning: Missing column names filled in: 'X1' [1]
library(GGally)
library(tidyverse)
glimpse(diamonds)
## Rows: 53,940
## Columns: 11
## $ X1      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0…
## $ cut     <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very Good", "…
## $ color   <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J", "J", "…
## $ clarity <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "SI1", "VS…
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 6…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 5…
## $ price   <dbl> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 3…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2…
  1. We expect the carat of the diamonds to be strong correlated with the physical dimensions x, y and z. Use ggpairs() to investigate correlations between these four variables.
ggpairs(diamonds[,c("carat", "x", "y", "z")])

  1. So, we do find significant correlations. Let’s drop columns x, y and z from the dataset, in preparation to use only carat going forward.
diamonds_tidy <- diamonds %>%
  select(-c("x", "y", "z"))

diamonds_tidy
  1. We are interested in developing a regression model for the price of a diamond in terms of the possible predictor variables in the dataset.
  1. Use ggpairs() to investigate correlations between price and the predictors (this may take a while to run, don’t worry, make coffee or something).
ggpairs(diamonds_tidy)

carat is strongly correlated with price. Boxplots of price split by cut, color and particularly clarity also show some variation.

  1. Perform further ggplot visualisations of any significant correlations you find.
diamonds_tidy %>%
  ggplot(aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

diamonds_tidy %>%
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot()

diamonds_tidy %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot()

diamonds_tidy %>%
  ggplot(aes(x = clarity, y = price)) +
  geom_boxplot()

  1. Shortly we may try a regression fit using one or more of the categorical predictors cut, clarity and color, so let’s investigate these predictors:
  1. Investigate the levels of these predictors. How many dummy variables do you expect for each of them?
unique(diamonds_tidy$cut)
## [1] "Ideal"     "Premium"   "Good"      "Very Good" "Fair"
# expect 4 dummies for cut

unique(diamonds_tidy$color)
## [1] "E" "I" "J" "H" "F" "G" "D"
# expect 6 dummies for color

unique(diamonds_tidy$clarity)
## [1] "SI2"  "SI1"  "VS1"  "VS2"  "VVS2" "VVS1" "I1"   "IF"
# expect 7 dummies for clarity
  1. Use the dummy_cols() function in the fastDummies package to generate dummies for these predictors and check the number of dummies in each case.
library(fastDummies)
diamonds_dummies <- dummy_cols(diamonds, select_columns = c("cut", "clarity", "color"), remove_first_dummy = TRUE)
glimpse(diamonds_dummies)
## Rows: 53,940
## Columns: 28
## $ X1              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ carat           <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22,…
## $ cut             <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very …
## $ color           <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J"…
## $ clarity         <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "S…
## $ depth           <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1,…
## $ table           <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 5…
## $ price           <dbl> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339…
## $ x               <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87,…
## $ y               <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78,…
## $ z               <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49,…
## $ cut_Good        <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
## $ cut_Ideal       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,…
## $ cut_Premium     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,…
## $ `cut_Very Good` <int> 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ clarity_IF      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ clarity_SI1     <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,…
## $ clarity_SI2     <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,…
## $ clarity_VS1     <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ clarity_VS2     <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ clarity_VVS1    <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ clarity_VVS2    <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ color_E         <int> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ color_F         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ color_G         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ color_H         <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ color_I         <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ color_J         <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,…
  1. Going forward we’ll let R handle dummy variable creation for categorical predictors in regression fitting (remember lm() will generate the correct numbers of dummy levels automatically, absorbing one of the levels into the intercept as a reference level)
  1. First, we’ll start with simple linear regression. Regress price on carat and check the regression diagnostics.
mod1 <- lm(price ~ carat, data = diamonds_tidy)
plot(mod1)

# the residuals show systematic variation, significant deviation from normality and heteroscedasticity. Oh dear...
  1. Run a regression with one or both of the predictor and response variables log() transformed and recheck the diagnostics. Do you see any improvement?
mod2_logx <- lm(price ~ log(carat), data = diamonds_tidy)
plot(mod2_logx)

mod2_logy <- lm(log(price) ~ carat, data = diamonds_tidy)
plot(mod2_logy)

mod2_logxlogy <- lm(log(price) ~ log(carat), data = diamonds_tidy)
plot(mod2_logxlogy)

# it looks like log transformation of both variables is required to get close to satisfying the regression assumptions.
  1. Let’s use log() transformations of both predictor and response. Next, experiment with adding a single categorical predictor into the model. Which categorical predictor is best? [Hint - investigate \(r^2\) values]
mod3_cut <- lm(log(price) ~ log(carat) + cut, data = diamonds_tidy)
summary(mod3_cut)
## 
## Call:
## lm(formula = log(price) ~ log(carat) + cut, data = diamonds_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52247 -0.16484 -0.00587  0.16087  1.38115 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.200125   0.006343 1292.69   <2e-16 ***
## log(carat)   1.695771   0.001910  887.68   <2e-16 ***
## cutGood      0.163245   0.007324   22.29   <2e-16 ***
## cutIdeal     0.317212   0.006632   47.83   <2e-16 ***
## cutPremium   0.238217   0.006716   35.47   <2e-16 ***
## cutVery Good 0.240753   0.006779   35.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2545 on 53934 degrees of freedom
## Multiple R-squared:  0.9371, Adjusted R-squared:  0.9371 
## F-statistic: 1.607e+05 on 5 and 53934 DF,  p-value: < 2.2e-16
mod3_color <- lm(log(price) ~ log(carat) + color, data = diamonds_tidy)
summary(mod3_color)
## 
## Call:
## lm(formula = log(price) ~ log(carat) + color, data = diamonds_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49987 -0.14888  0.00257  0.15316  1.27815 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  8.572034   0.003051 2809.531  < 2e-16 ***
## log(carat)   1.728631   0.001814  952.727  < 2e-16 ***
## colorE      -0.025460   0.003748   -6.793 1.11e-11 ***
## colorF      -0.034455   0.003773   -9.132  < 2e-16 ***
## colorG      -0.055399   0.003653  -15.166  < 2e-16 ***
## colorH      -0.189859   0.003917  -48.468  < 2e-16 ***
## colorI      -0.286928   0.004383  -65.467  < 2e-16 ***
## colorJ      -0.425038   0.005417  -78.466  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2372 on 53932 degrees of freedom
## Multiple R-squared:  0.9454, Adjusted R-squared:  0.9454 
## F-statistic: 1.333e+05 on 7 and 53932 DF,  p-value: < 2.2e-16
mod3_clarity <- lm(log(price) ~ log(carat) + clarity, data = diamonds_tidy)
summary(mod3_clarity)
## 
## Call:
## lm(formula = log(price) ~ log(carat) + clarity, data = diamonds_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97521 -0.12085  0.01048  0.12561  1.85854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.768115   0.006940 1119.25   <2e-16 ***
## log(carat)  1.806424   0.001514 1193.23   <2e-16 ***
## clarityIF   1.114625   0.008376  133.07   <2e-16 ***
## claritySI1  0.624558   0.007163   87.19   <2e-16 ***
## claritySI2  0.479658   0.007217   66.46   <2e-16 ***
## clarityVS1  0.820461   0.007306  112.30   <2e-16 ***
## clarityVS2  0.775248   0.007197  107.72   <2e-16 ***
## clarityVVS1 1.028298   0.007745  132.77   <2e-16 ***
## clarityVVS2 0.979221   0.007529  130.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 53931 degrees of freedom
## Multiple R-squared:  0.9654, Adjusted R-squared:  0.9654 
## F-statistic: 1.879e+05 on 8 and 53931 DF,  p-value: < 2.2e-16
# clarity leads to a model with highest r^2, all predictors are significant
  1. [Try this question if you have looked at the material on transformations] Interpret the fitted coefficients for the levels of your chosen categorical predictor. Which level is the reference level? Which level shows the greatest difference in price from the reference level? [Hints - remember we are regressing the log(price) here, and think about what the presence of the log(carat) predictor implies. We’re not expecting a mathematical explanation]
# 'I1' is the reference level for clarity. In multiple linear regression, the interpretation of any 
# coefficient is the change in response due to that predictor with other predictors held constant
# log(price) makes the relationship geometric. Clarity = 'IF' shows the most difference from the reference level.

# Here's how to interpret the `clarityIF` coefficient in the presence of the `log(price)` response

ratio <- exp(1.114625)
ratio
## [1] 3.048425
# so, on average, the price of an IF diamond will be approx. 3 times higher than that of I1 diamond of same carat.

2 Extension

  1. Try adding an interaction between log(carat) and your chosen categorical predictor. Do you think this interaction term is statistically justified?
mod4_clarity_inter <- lm(log(price) ~ log(carat) + clarity + log(carat):clarity, data = diamonds_tidy)
summary(mod4_clarity_inter)
## 
## Call:
## lm(formula = log(price) ~ log(carat) + clarity + log(carat):clarity, 
##     data = diamonds_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92773 -0.12104  0.01212  0.12465  1.51830 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.808102   0.007223 1080.98   <2e-16 ***
## log(carat)             1.528106   0.014944  102.25   <2e-16 ***
## clarityIF              1.122732   0.011381   98.65   <2e-16 ***
## claritySI1             0.587556   0.007465   78.71   <2e-16 ***
## claritySI2             0.438797   0.007486   58.62   <2e-16 ***
## clarityVS1             0.790989   0.007721  102.45   <2e-16 ***
## clarityVS2             0.723109   0.007530   96.03   <2e-16 ***
## clarityVVS1            1.007591   0.009506  106.00   <2e-16 ***
## clarityVVS2            0.968426   0.008359  115.85   <2e-16 ***
## log(carat):clarityIF   0.337116   0.017593   19.16   <2e-16 ***
## log(carat):claritySI1  0.288214   0.015254   18.89   <2e-16 ***
## log(carat):claritySI2  0.258795   0.015437   16.76   <2e-16 ***
## log(carat):clarityVS1  0.300307   0.015395   19.51   <2e-16 ***
## log(carat):clarityVS2  0.250187   0.015237   16.42   <2e-16 ***
## log(carat):clarityVVS1 0.301955   0.016317   18.51   <2e-16 ***
## log(carat):clarityVVS2 0.321665   0.015717   20.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1877 on 53924 degrees of freedom
## Multiple R-squared:  0.9658, Adjusted R-squared:  0.9658 
## F-statistic: 1.014e+05 on 15 and 53924 DF,  p-value: < 2.2e-16
# obtain only a small improvement in r^2 from the interaction. 
# we'll see shortly that the correct test would be an anova() comparing both models
anova(mod3_clarity, mod4_clarity_inter)
# the small p-value suggests interaction is statistically significant, but the effect is small.
  1. Find and plot an appropriate visualisation to show the effect of this interaction
diamonds_tidy %>%
  ggplot(aes(x = log(carat), y = log(price), colour = clarity)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ clarity)
## `geom_smooth()` using formula 'y ~ x'

# not much evidence that the gradient of the line varies significantly with clarity